Group-1 : US Car Accidents Analysis¶


Project Members: 1) Ajinkya Desai, 2) Akash Bharsakle, 3) Asawari Kadam, 4) Prachi Kotkar

In [1]:
from IPython.core.interactiveshell import InteractiveShell 
InteractiveShell.ast_node_interactivity = "all"

%autosave 0

import pandas as pd
import numpy as np
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn as sk
import sklearn.tree as tree
from IPython.display import Image  
import pydotplus
from sklearn.cluster import KMeans


import warnings
#suppress all future warning
warnings.filterwarnings('ignore')
Autosave disabled
In [2]:
df = pd.read_csv('US_Accidents_March23.csv',index_col=0, parse_dates=True)

pd.set_option('display.float_format', lambda x: '%.2f' % x)

pd.set_option('display.max_rows', 100)
pd.set_option('display.min_rows', 100)

1. Data Set Description¶

This dataset includes information about car accidents across the entire United States, covering 49 states (Except Alaska) spanning over a duration starting from February 2016 to March 2023. The dataset contains approximately 7.7 million accident records from all over USA.

We have divided all of the important columns in 4 categories which are as follows:¶

General Info

  • Severity : Shows the severity of the accident, where 1 indicates the least impact on traffic and 4 indicates the most.
  • Start_Time : Start time of the accident in local time zone.
  • End_Time : End time of the accident in local time zone.
  • Start_Lat : Shows latitude in GPS coordinate of the start point.
  • Start_Lng : Shows longitude in GPS coordinate of the start point.
  • Distance : The length of the road extent affected by the accident in miles.
  • Street : Name of the Street accident happened on.
  • City : Shows the city in address field.
  • State : Shows the state in address field.
  • Zipcode : Shows the zipcode in address field.
  • Timezone : Shows timezone based on the location of the accident (eastern, central, etc.).

Weather Situations

  • Temperature : Shows the temperature (in Fahrenheit).
  • Wind_Chill : Shows the wind chill (in Fahrenheit).
  • Humidity : Shows the humidity (in percentage).
  • Pressure : Shows the air pressure (in inches).
  • Visibility : Shows visibility (in miles).
  • Wind_Speed : Shows wind speed (in miles per hour).
  • Weather_Condition : Shows the weather condition (rain, snow, thunderstorm, fog, etc.)

Road Conditions

  • Amenity : indicates presence of amenity in a nearby location.
  • Bump : indicates presence of speed bump or hump in a nearby location.
  • Crossing : indicates presence of crossing in a nearby location.
  • Give_Way : indicates presence of give_way in a nearby location.
  • Junction : indicates presence of junction in a nearby location.
  • No_Exit : indicates no exit in a nearby location.
  • Railway : indicates presence of railway route in a nearby location.
  • Roundabout : indicates presence of roundabout in a nearby location.
  • Station : indicates presence of station in a nearby location.
  • Stop : indicates presence of stop sign in a nearby location.
  • Traffic_Calming : indicates whether traffic calming measures are present in the vicinity of a specific accident location
  • Traffic_Signal : indicates presence of traffic signal in a nearby location
  • Turning_Loop : indicates presence of turning loop in a nearby location.

Period of the Day

  • Sunrise_Sunset : period of day (i.e. day or night) based on sunrise/sunset.
  • Civil_Twilight : Shows the period of day (i.e. day or night) based on civil twilight.
  • Nautical_Twilight : Shows the period of day (i.e. day or night) based on nautical twilight.
  • Astronomical_Twilight : Shows the period of day (i.e. day or night) based on Astronomical Twilight.
In [3]:
df.shape
Out[3]:
(7728394, 45)
In [4]:
df.columns.values
Out[4]:
array(['Source', 'Severity', 'Start_Time', 'End_Time', 'Start_Lat',
       'Start_Lng', 'End_Lat', 'End_Lng', 'Distance(mi)', 'Description',
       'Street', 'City', 'County', 'State', 'Zipcode', 'Country',
       'Timezone', 'Airport_Code', 'Weather_Timestamp', 'Temperature(F)',
       'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)',
       'Wind_Direction', 'Wind_Speed(mph)', 'Precipitation(in)',
       'Weather_Condition', 'Amenity', 'Bump', 'Crossing', 'Give_Way',
       'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop',
       'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'], dtype=object)
In [5]:
df.reset_index(inplace=True)

Cleaning the dataSet¶

In [6]:
df.head()
Out[6]:
ID Source Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) ... Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
0 A-1 Source2 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.87 -84.06 NaN NaN 0.01 ... False False False False False False Night Night Night Night
1 A-2 Source2 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.93 -82.83 NaN NaN 0.01 ... False False False False False False Night Night Night Day
2 A-3 Source2 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.06 -84.03 NaN NaN 0.01 ... False False False False True False Night Night Day Day
3 A-4 Source2 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.75 -84.21 NaN NaN 0.01 ... False False False False False False Night Day Day Day
4 A-5 Source2 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.63 -84.19 NaN NaN 0.01 ... False False False False True False Day Day Day Day

5 rows × 46 columns

Replacing the missing values with NaN

In [7]:
df.replace(to_replace=['?',' '],value=np.nan,inplace=True)

Checking how many NaN actually exist

In [8]:
df.isna().sum()
Out[8]:
ID                             0
Source                         0
Severity                       0
Start_Time                     0
End_Time                       0
Start_Lat                      0
Start_Lng                      0
End_Lat                  3402762
End_Lng                  3402762
Distance(mi)                   0
Description                    5
Street                     10869
City                         253
County                         0
State                          0
Zipcode                     1915
Country                        0
Timezone                    7808
Airport_Code               22635
Weather_Timestamp         120228
Temperature(F)            163853
Wind_Chill(F)            1999019
Humidity(%)               174144
Pressure(in)              140679
Visibility(mi)            177098
Wind_Direction            175206
Wind_Speed(mph)           571233
Precipitation(in)        2203586
Weather_Condition         173459
Amenity                        0
Bump                           0
Crossing                       0
Give_Way                       0
Junction                       0
No_Exit                        0
Railway                        0
Roundabout                     0
Station                        0
Stop                           0
Traffic_Calming                0
Traffic_Signal                 0
Turning_Loop                   0
Sunrise_Sunset             23246
Civil_Twilight             23246
Nautical_Twilight          23246
Astronomical_Twilight      23246
dtype: int64

Few of the columns have only 1 class like 'Country' and 'Turning_Loop'

In [9]:
categorical_col = ['Country','Timezone','Bump','Crossing','Junction','No_Exit','Railway','Roundabout','Station',\
             'Stop','Traffic_Signal', 'Turning_Loop','Sunrise_Sunset']
for i in categorical_col:
    print(i,df[i].unique().size)
Country 1
Timezone 5
Bump 2
Crossing 2
Junction 2
No_Exit 2
Railway 2
Roundabout 2
Station 2
Stop 2
Traffic_Signal 2
Turning_Loop 1
Sunrise_Sunset 3

Dropping the below columns as some of those('ID','Description', 'Wind_Direction', 'End_Lng', 'End_Lat','Country', 'Source', 'County', 'ID', 'Airport_Code','Precipitation(in)') did not contribute much for insights, some ('Turning_Loop','Country') have only one unique value and 'Weather_Timestamp' has very similar time as "Start_Time".

In [10]:
df.drop(columns=['ID','Description', 'Wind_Direction', 'End_Lng', 'End_Lat', 'Weather_Timestamp',\
                 'Country', 'Source', 'County', 'ID', 'Airport_Code','Precipitation(in)'],inplace=True)

Dropping Duplicates from the whole dataset

In [11]:
df.drop_duplicates(inplace=True)

Renaming the columns to perform calculations and have easy access (Removing brackets from the column names)

In [12]:
df.rename(columns={'Distance(mi)':'Distance','Temperature(F)':'Temperature','Wind_Chill(F)':'Wind_Chill',\
                  'Humidity(%)':'Humidity','Pressure(in)':'Pressure','Visibility(mi)':'Visibility',\
                   'Wind_Speed(mph)':'Wind_Speed'},inplace=True)

Dropping NaNs from the columns with insignificant number of NaNs.

In [13]:
df.dropna(how='any',subset=['City'],inplace=True)
df.dropna(how='any',subset=['Street'],inplace=True)
df.dropna(how='any',subset=['Zipcode'],inplace=True)
df.dropna(how='any',subset=['Timezone'],inplace=True)
df.dropna(how='any',subset=['Civil_Twilight'],inplace=True)
df.dropna(how='any',subset=['Astronomical_Twilight'],inplace=True)
df.dropna(how='any',subset=['Nautical_Twilight'],inplace=True)

Filling in NaN's with the mean and median values of the respective columns as the NaNs in these columns are significant in number.

In [14]:
df['Temperature'].fillna(df['Temperature'].mean(), inplace=True)
df['Wind_Chill'].fillna(df['Wind_Chill'].mean(), inplace=True)
df['Humidity'].fillna(df['Humidity'].mean(), inplace=True)
df['Pressure'].fillna(df['Pressure'].mean(), inplace=True)
df['Visibility'].fillna(df['Visibility'].median(), inplace=True)
df['Wind_Speed'].fillna(df['Wind_Speed'].median(), inplace=True)

Filling in NaN's with forward filling method

In [15]:
df['Weather_Condition'].fillna(method='ffill',inplace=True)

Formatting date-time

In [16]:
df.Start_Time = pd.to_datetime(df.Start_Time)
df.End_Time = pd.to_datetime(df.End_Time)

Checking the number of NaN's present after the data cleaning

In [17]:
df.isna().sum().sum()
Out[17]:
0

Final Shape of the Data Frame

In [18]:
df.shape
Out[18]:
(7566065, 35)
In [19]:
df.columns.values
Out[19]:
array(['Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
       'Distance', 'Street', 'City', 'State', 'Zipcode', 'Timezone',
       'Temperature', 'Wind_Chill', 'Humidity', 'Pressure', 'Visibility',
       'Wind_Speed', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout',
       'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Turning_Loop', 'Sunrise_Sunset', 'Civil_Twilight',
       'Nautical_Twilight', 'Astronomical_Twilight'], dtype=object)
In [20]:
df.describe()
Out[20]:
Severity Start_Lat Start_Lng Distance Temperature Wind_Chill Humidity Pressure Visibility Wind_Speed
count 7566065.00 7566065.00 7566065.00 7566065.00 7566065.00 7566065.00 7566065.00 7566065.00 7566065.00 7566065.00
mean 2.21 36.19 -94.74 0.56 61.74 58.31 64.81 29.54 9.11 7.64
std 0.49 5.07 17.39 1.77 18.82 19.27 22.57 0.99 2.66 5.22
min 1.00 24.55 -124.62 0.00 -89.00 -89.00 1.00 0.00 0.00 0.00
25% 2.00 33.39 -117.23 0.00 50.00 52.00 49.00 29.39 10.00 5.00
50% 2.00 35.80 -87.83 0.03 63.00 58.31 66.00 29.85 10.00 7.00
75% 2.00 40.07 -80.37 0.45 76.00 71.00 84.00 30.03 10.00 10.00
max 4.00 49.00 -67.11 441.75 207.00 207.00 100.00 58.63 140.00 1087.00

Below bar chart is to give a brief idea of the accident counts over the year

In [21]:
month_df = pd.DataFrame(df.Start_Time.dt.month.value_counts()).reset_index()
month_df.sort_values(by='index')
Out[21]:
index Start_Time
1 1 740919
4 2 649718
10 3 545929
7 4 574685
9 5 548259
8 6 561513
11 7 506141
6 8 592922
5 9 638268
3 10 656248
2 11 733412
0 12 818051
In [22]:
month = month_df.rename(columns={'Start_Time':'count','index':'month'}).sort_values(by='month', ascending=True)
In [23]:
# Highlighting the months of interest(only months with min and max counts) here

sns.catplot(x='month',y='count',data=month,kind='bar',\
            palette = ["lightpink", "lightpink","lightpink","lightpink","lightpink","lightpink","blue",\
                      "lightpink", "lightpink","lightpink","lightpink","red"])
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x2b62d5250>

1) Analysis of Accident counts by zone and Road Parameters¶

In [24]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(nrows=4, ncols=2, figsize = (16,20))

road_conditions = ['Bump', 'Crossing', 'Give_Way', 'Junction', 'Stop', 'No_Exit', 'Traffic_Signal', 'Turning_Loop']
colors = [('#6662b3', '#00FF00'), ('#7881ff', '#0e1ce8'), ('#18f2c7', '#09ad8c'), ('#08ff83', '#02a352'), ('#ffcf87', '#f5ab3d'),
         ('#f5f53d', '#949410'), ('#ff9187', '#ffc7c2'), ('tomato', '#008000')]    
count = 0

def func(pct, allvals):
    absolute = int(round(pct/100*np.sum(allvals), 2))
    return "{:.2f}%\n({:,d} Cases)".format(pct, absolute)    

for i in [ax1, ax2, ax3, ax4, ax5, ax6, ax7, ax8]:
    
    size = list(df[road_conditions[count]].value_counts())
    if len(size) != 2:
        size.append(0)
    
    labels = ['False', 'True']
    
    
    
    i.pie(size, labels = labels, colors = colors[count],
                    autopct = lambda pct: func(pct, size), labeldistance=1.1,
                    textprops={'fontsize': 12}, explode=[0, 0.2])

    title = '\nPresence of {}'.format(road_conditions[count])

    i.set_title(title, fontsize = 18, color='grey')
    
    count += 1
Out[24]:
([<matplotlib.patches.Wedge at 0x2baf30790>,
  <matplotlib.patches.Wedge at 0x3a589bad0>],
 [Text(-1.0999988740399846, 0.0015738839747409406, 'False'),
  Text(1.2999986694113073, -0.0018599808685418522, 'True')],
 [Text(-0.5999993858399916, 0.0008584821680405131, '99.95%\n(7,562,619 Cases)'),
  Text(0.7999991811761888, -0.0011446036114103706, '0.05%\n(3,446 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Bump')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da07b90>,
  <matplotlib.patches.Wedge at 0x13da15dd0>],
 [Text(-1.030796917927756, 0.3840022317521592, 'False'),
  Text(1.2182145499916088, -0.45382079082909144, 'True')],
 [Text(-0.5622528643242305, 0.20945576277390499, '88.65%\n(6,707,239 Cases)'),
  Text(0.7496704923025285, -0.2792743328179024, '11.35%\n(858,825 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Crossing')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da1abd0>,
  <matplotlib.patches.Wedge at 0x13d9a5f50>],
 [Text(-1.0998790396284326, 0.016312516238679214, 'False'),
  Text(1.299857045790098, -0.019278498640697785, 'True')],
 [Text(-0.5999340216155086, 0.008897736130188662, '99.53%\n(7,530,348 Cases)'),
  Text(0.7999120281785217, -0.011863691471198637, '0.47%\n(35,716 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Give_Way')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da2c3d0>,
  <matplotlib.patches.Wedge at 0x13d9a5650>],
 [Text(-1.070813345297021, 0.2517116992390363, 'False'),
  Text(1.2655066668796437, -0.29747752197968597, 'True')],
 [Text(-0.5840800065256477, 0.13729729049401979, '92.65%\n(7,010,037 Cases)'),
  Text(0.7787733334643961, -0.1830630904490375, '7.35%\n(556,027 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Junction')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da38310>,
  <matplotlib.patches.Wedge at 0x3a3251690>],
 [Text(-1.0958176459248923, 0.09583155471778404, 'False'),
  Text(1.295057213934832, -0.11325551922688476, 'True')],
 [Text(-0.597718715959032, 0.05227175711879129, '97.22%\n(7,355,983 Cases)'),
  Text(0.7969582854983581, -0.06969570413962138, '2.78%\n(210,081 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Stop')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da3a790>,
  <matplotlib.patches.Wedge at 0x3415853d0>],
 [Text(-1.0999649825775546, 0.00877707828152855, 'False'),
  Text(1.2999586164411587, -0.010372827020102317, 'True')],
 [Text(-0.599980899587757, 0.004787497244470117, '99.75%\n(7,546,848 Cases)'),
  Text(0.7999745331945591, -0.00638327816621681, '0.25%\n(19,217 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of No_Exit')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da54110>,
  <matplotlib.patches.Wedge at 0x13d9a52d0>],
 [Text(-0.9826720942575355, 0.4943233305919411, 'False'),
  Text(1.1613397204105145, -0.5842003541567125, 'True')],
 [Text(-0.5360029605041102, 0.2696309075956042, '85.16%\n(6,443,587 Cases)'),
  Text(0.7146705971757011, -0.3595079102502846, '14.84%\n(1,122,477 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Traffic_Signal')
Out[24]:
([<matplotlib.patches.Wedge at 0x13da54790>,
  <matplotlib.patches.Wedge at 0x13da39350>],
 [Text(-1.1, 1.3471114790620887e-16, 'False'),
  Text(1.3, -3.1840816777831187e-16, 'True')],
 [Text(-0.6, 7.347880794884119e-17, '100.00%\n(7,566,065 Cases)'),
  Text(0.8, -1.9594348786357652e-16, '0.00%\n(0 Cases)')])
Out[24]:
Text(0.5, 1.0, '\nPresence of Turning_Loop')
In [25]:
df_zone = df.copy()
In [26]:
df_zone.columns
Out[26]:
Index(['Severity', 'Start_Time', 'End_Time', 'Start_Lat', 'Start_Lng',
       'Distance', 'Street', 'City', 'State', 'Zipcode', 'Timezone',
       'Temperature', 'Wind_Chill', 'Humidity', 'Pressure', 'Visibility',
       'Wind_Speed', 'Weather_Condition', 'Amenity', 'Bump', 'Crossing',
       'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station',
       'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop',
       'Sunrise_Sunset', 'Civil_Twilight', 'Nautical_Twilight',
       'Astronomical_Twilight'],
      dtype='object')
In [27]:
Northwest =['OR','WA','ID']
Southwest =['CA','NV','AZ','UT']
North_Central = ['MT','WY', 'CO','ND', 'SD', 'NE', 'KS', 'MN', 'IA', 'MO']
South_Central = ['NM','TX','OK','LA']
Midwest =['WI', 'IL','IN', 'OH','MI']
Southeast = ['AR','TN', 'MS', 'AL', 'FL', 'GA', 'SC', 'NC']
Northeast = [ 'DC','KY', 'VA', 'WV', 'MD', 'DE', 'PA', 'NJ', 'NY', 'CT', 'RI', 'MA', 'NH', 'VT', 'ME']               
                 

df_zone['Zone'] = np.select(
    [df_zone['State'].isin(Northwest), df_zone['State'].isin(Southwest), df_zone['State'].isin(North_Central), df_zone['State'].isin(South_Central),df_zone['State'].isin(Midwest),df_zone['State'].isin(Southeast),df_zone['State'].isin(Northeast)],
    ['Northwest', 'Southwest', 'North Central', 'South Central','Midwest','Southeast','Northeast'],
    default='Unknown'
)
In [28]:
plt.figure(figsize=(12, 8))

# Grouping by Zone and calculating the counts for each parameter
grouped_df = df_zone.groupby('Zone')[['Traffic_Signal', 'Crossing', 'Junction']].sum().reset_index()

# Melting the DataFrame for better visualization
melted_df = pd.melt(grouped_df, id_vars='Zone', var_name=' Road Parameters', value_name='Count')

# Plotting the bar graph
ax = sns.barplot(x='Zone', y='Count', hue=' Road Parameters', data=melted_df)
plt.title('Accidents Count by Zone and Road Parameters')
plt.xlabel('Zone')
plt.ylabel('Count')

# Calculating and displaying the percentages on each bar
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2., height, f'{height / grouped_df["Traffic_Signal"].sum() * 100:.2f}%', 
            ha="center", va="bottom")

plt.show()
Out[28]:
<Figure size 1200x800 with 0 Axes>
Out[28]:
Text(0.5, 1.0, 'Accidents Count by Zone and Road Parameters')
Out[28]:
Text(0.5, 0, 'Zone')
Out[28]:
Text(0, 0.5, 'Count')
Out[28]:
Text(-0.2666666666666667, 105062.0, '9.36%')
Out[28]:
Text(0.7333333333333333, 44441.0, '3.96%')
Out[28]:
Text(1.7333333333333334, 190329.0, '16.96%')
Out[28]:
Text(2.7333333333333334, 29074.0, '2.59%')
Out[28]:
Text(3.7333333333333334, 205623.0, '18.32%')
Out[28]:
Text(4.733333333333333, 350971.0, '31.27%')
Out[28]:
Text(5.733333333333333, 196977.0, '17.55%')
Out[28]:
Text(0.0, 66805.0, '5.95%')
Out[28]:
Text(1.0, 34637.0, '3.09%')
Out[28]:
Text(2.0, 126794.0, '11.30%')
Out[28]:
Text(3.0, 38433.0, '3.42%')
Out[28]:
Text(4.0, 111878.0, '9.97%')
Out[28]:
Text(5.0, 322154.0, '28.70%')
Out[28]:
Text(6.0, 158125.0, '14.09%')
Out[28]:
Text(0.2666666666666667, 39247.0, '3.50%')
Out[28]:
Text(1.2666666666666666, 41820.0, '3.73%')
Out[28]:
Text(2.2666666666666666, 130405.0, '11.62%')
Out[28]:
Text(3.2666666666666666, 15280.0, '1.36%')
Out[28]:
Text(4.266666666666667, 39636.0, '3.53%')
Out[28]:
Text(5.266666666666667, 95424.0, '8.50%')
Out[28]:
Text(6.266666666666667, 194215.0, '17.30%')
In [29]:
## Decision Tree
In [30]:
df2 = df_zone.copy()
zone_mapping = {'Northwest':1, 'Southwest':2, 'North Central':3, 'South Central':4,'Midwest':5,'Southeast':6,\
                'Northeast':7}
df2['Zone'] = df2['Zone'].map(zone_mapping)
In [31]:
df2.drop(columns=['Start_Time','Start_Lat','Start_Lng','Street','City','State','Zipcode',\
                  'Wind_Speed','Distance','Railway','No_Exit','Timezone','Temperature','Visibility',\
                  'Weather_Condition','Civil_Twilight','Nautical_Twilight', 'Astronomical_Twilight','End_Time',\
                  'Wind_Chill','Humidity','Pressure','Turning_Loop'],inplace= True)
In [32]:
df2.columns
Out[32]:
Index(['Severity', 'Amenity', 'Bump', 'Crossing', 'Give_Way', 'Junction',
       'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal',
       'Sunrise_Sunset', 'Zone'],
      dtype='object')
In [33]:
df2['Bump']= df2['Bump']+0.0
df2['Crossing'] = df2['Crossing'] + 0.0
df2['Give_Way'] = df2['Give_Way'] + 0.0
df2['Junction'] = df2['Junction'] + 0.0
df2['Roundabout'] = df2['Roundabout'] + 0.0
df2['Station'] = df2['Station'] + 0.0
df2['Stop'] = df2['Stop'] + 0.0
df2['Traffic_Signal'] = df2['Traffic_Signal'] + 0.0
df2['Traffic_Calming'] = df2['Traffic_Calming'] +0.0
In [34]:
df2['Sunrise_Sunset'] = df2['Sunrise_Sunset'].map({'Day': 1, 'Night': 0})
df2['Sunrise_Sunset'].fillna(df2['Sunrise_Sunset'].mean(), inplace=True)
In [35]:
dt= tree.DecisionTreeClassifier(max_depth=1)
In [36]:
X = df2.drop('Zone',axis=1)
Y = df2.Zone
In [37]:
dt.fit(X,Y)
Out[37]:
DecisionTreeClassifier(max_depth=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=1)
In [38]:
dt_feature_names = list(X.columns)
dt_target_names = [str(s) for s in Y.unique()]
tree.export_graphviz(dt, out_file='tree.dot', 
    feature_names=dt_feature_names, class_names=dt_target_names,
    filled=True)  
graph = pydotplus.graph_from_dot_file('tree.dot')
Image(graph.create_png())
Out[38]:
In [39]:
# Selecting rows for Zone 7 and Zone 4 with Traffic_Signal equal to 1 for validation of the decision tree
selected_zones_traffic_signal_1 = df2[(df2['Zone'].isin([7, 4])) & (df2['Traffic_Signal'] == 1)]

# Plotting
plt.figure(figsize=(10, 6))
sns.countplot(data=selected_zones_traffic_signal_1, x='Zone')

plt.title('Count of Accidents with Traffic Signal = 1 for Zone 7 and Zone 4')
plt.xlabel('Zone')
plt.ylabel('Count')
plt.show()
Out[39]:
<Figure size 1000x600 with 0 Axes>
Out[39]:
<Axes: xlabel='Zone', ylabel='count'>
Out[39]:
Text(0.5, 1.0, 'Count of Accidents with Traffic Signal = 1 for Zone 7 and Zone 4')
Out[39]:
Text(0.5, 0, 'Zone')
Out[39]:
Text(0, 0.5, 'Count')

Insight

  • You will be amazed with the fact that almost 34% of accidents have occoured even though there were Traffic Signal, Crossing and Junction present in location. ((Traffic Signal) 14.84 + (Crossing) 11.35 + (Junction) 7.35 =33.54 %)

  • On detailed analysis of these factors by diving the US states into 7 zones based on geographical conditions, it is quite evident from the graph that 68.47% of accidents in south east region occurred even though these 3 factors(Traffic signal,Crossing and Junction) were present there. South East - Traffic Signal) 31.27 + (Crossing) 28.7 + (Junction) 8.50 = 68.47 %

Recommendation:

More penalties must be levied by the Government for traffic rules violations( specially in south east region) so that people would be more cautious while driving and accidents count would be less.

2) Accidents Analysis based on Road Bumps (Speed Breakers)¶

In [40]:
severity_counts = df['Severity'].value_counts()
severity_counts

# Most accidents are of severity '2' and '3', 
# with very few cases of severity '1' and '4'.
Out[40]:
2    6011746
3    1291377
4     197935
1      65007
Name: Severity, dtype: int64
In [41]:
pie_chart = df.groupby('Severity')['Severity'].count().\
            plot(kind='pie',figsize=(6, 6),autopct='%1.0f%%',cmap="Blues")

labels = severity_counts.index.tolist()
# Adding a title to the pie chart
pie_chart.set_title('Distribution of Accident Severity', fontsize=15)
Out[41]:
Text(0.5, 1.0, 'Distribution of Accident Severity')
In [42]:
# 5. Traffic Management and Infrastructure Improvement
# Analyzing accidents based on city or county
citywise_accidents = df['City'].value_counts().head(10)  # Top 10 cities
citywise_accidents
# Citywise distribution plot
sns.barplot(y=citywise_accidents.index, x=citywise_accidents.values)
plt.title('Top 10 Cities with Most Accidents')
plt.xlabel('Number of Accidents')
plt.ylabel('City')
# plt.text(1, max(citywise_accidents.values) * 1, 'Miami and LA identify areas with high accident frequencies to improve road safety measures', fontsize=6, color='black')
plt.show()
Out[42]:
Miami          183485
Houston        168242
Los Angeles    154732
Charlotte      136731
Dallas         129743
Orlando        108517
Austin          96411
Raleigh         85057
Nashville       72210
Baton Rouge     70682
Name: City, dtype: int64
Out[42]:
<Axes: >
Out[42]:
Text(0.5, 1.0, 'Top 10 Cities with Most Accidents')
Out[42]:
Text(0.5, 0, 'Number of Accidents')
Out[42]:
Text(0, 0.5, 'City')

Hotspot Analysis: Major citis like Miami, Houston, LA identify areas with high accident frequencies

In [43]:
# Group by 'City' and count the number of accidents
city_accidents = df.groupby('City').size().reset_index(name='Accident_Count')

# 'Start_Lat' and 'Start_Lng' give the coordinates for the accidents
cities_df = df[['City', 'Start_Lat', 'Start_Lng']].drop_duplicates(subset=['City'])

# Merge the city locations with the accident counts
plotting_df = pd.merge(cities_df, city_accidents, on='City')

# Set your Mapbox access token here
px.set_mapbox_access_token('pk.eyJ1IjoiYWppbmt5YWRlc2FpIiwiYSI6ImNscGd6OHJlbzAyOXoyanJ4a3E5eHM1Y3kifQ.ZzUvGP5rqkSuSmsfAYZ3HA')

# Create a scatter mapbox to visualize accidents by city
fig = px.scatter_mapbox(plotting_df,
                        lat='Start_Lat',
                        lon='Start_Lng',
                        size='Accident_Count',
                        color='Accident_Count',
                        color_continuous_scale=px.colors.sequential.Tealgrn,
                        size_max=20,
                        zoom=3,
                        hover_name='City',
#                         height=1200,  # You may adjust this value as needed
#                         width=800, 
                        title='Accidents in the United States by City',
                        mapbox_style='light')


# Show the plot
fig.show()
In [44]:
state_severity_counts = df.groupby(['State','Severity']).size().reset_index(name='Accident_Count')

# First, sort the DataFrame by 'State' and 'State_Severity_Count' in descending order
state_severity_counts_sorted = state_severity_counts.sort_values(by=['State', 'Accident_Count'], ascending=[True, False])

# Now, drop duplicate states, keeping the first occurrence (which will be the highest severity due to sorting)
highest_severity_per_state = state_severity_counts_sorted.drop_duplicates(subset='State')


highest_severity_per_state_sorted = highest_severity_per_state.sort_values(by='Accident_Count', ascending=False)

len(highest_severity_per_state_sorted)
# Print the resulting DataFrame
highest_severity_per_state_sorted.head(3)
Out[44]:
49
Out[44]:
State Severity Accident_Count
13 CA 2 1419150
33 FL 2 737886
163 TX 2 445580
In [45]:
state_accident_counts = df.groupby(['State',]).size().reset_index(name='Accident_Count')
state_accident_counts_sorted = state_accident_counts.sort_values(by='Accident_Count', ascending=False)
state_accident_counts_sorted.head(3)
Out[45]:
State Accident_Count
3 CA 1713431
8 FL 860790
41 TX 576923
In [46]:
fig = px.choropleth(state_accident_counts, 
                    locations="State",
                    locationmode='USA-states',
                    color="Accident_Count", 
                    hover_name="State",
                    scope='usa',
                    title='US Accidents Statewise Count',
                    width=800,
                    height=400,
                    color_continuous_scale=px.colors.sequential.Tealgrn,
                   labels={'Accident_Count': 'Number of Accidents'}
                   )
                    
fig.show()

It seems that most accidents with severity 2 are happending in major states like California, Texas, Florida

In [47]:
correlation_matrix = df.corr()
print(correlation_matrix)
                 Severity  Start_Lat  Start_Lng  Distance  Temperature  \
Severity             1.00       0.07       0.05      0.04        -0.02   
Start_Lat            0.07       1.00      -0.07      0.06        -0.44   
Start_Lng            0.05      -0.07       1.00      0.01        -0.01   
Distance             0.04       0.06       0.01      1.00        -0.05   
Temperature         -0.02      -0.44      -0.01     -0.05         1.00   
Wind_Chill          -0.06      -0.42      -0.03     -0.04         0.90   
Humidity             0.02       0.02       0.18      0.01        -0.33   
Pressure             0.04      -0.19       0.19     -0.09         0.11   
Visibility          -0.00      -0.09      -0.02     -0.04         0.21   
Wind_Speed           0.04       0.03       0.07      0.01         0.03   
Amenity             -0.03       0.02       0.01     -0.03         0.01   
Bump                -0.01      -0.00      -0.01     -0.00         0.00   
Crossing            -0.11      -0.07       0.06     -0.09         0.06   
Give_Way            -0.00       0.00       0.02     -0.01         0.00   
Junction             0.05       0.05      -0.04      0.03        -0.02   
No_Exit             -0.01      -0.01       0.01     -0.01         0.01   
Railway             -0.01       0.00      -0.02     -0.02         0.00   
Roundabout          -0.00      -0.00       0.00     -0.00         0.00   
Station             -0.05      -0.05       0.01     -0.04         0.03   
Stop                -0.05      -0.00      -0.03     -0.03         0.01   
Traffic_Calming     -0.01      -0.00      -0.00     -0.01         0.00   
Traffic_Signal      -0.11      -0.07       0.07     -0.11         0.05   
Turning_Loop          NaN        NaN        NaN       NaN          NaN   

                 Wind_Chill  Humidity  Pressure  Visibility  Wind_Speed  ...  \
Severity              -0.06      0.02      0.04       -0.00        0.04  ...   
Start_Lat             -0.42      0.02     -0.19       -0.09        0.03  ...   
Start_Lng             -0.03      0.18      0.19       -0.02        0.07  ...   
Distance              -0.04      0.01     -0.09       -0.04        0.01  ...   
Temperature            0.90     -0.33      0.11        0.21        0.03  ...   
Wind_Chill             1.00     -0.27      0.08        0.20       -0.04  ...   
Humidity              -0.27      1.00      0.11       -0.38       -0.17  ...   
Pressure               0.08      0.11      1.00        0.04       -0.02  ...   
Visibility             0.20     -0.38      0.04        1.00        0.01  ...   
Wind_Speed            -0.04     -0.17     -0.02        0.01        1.00  ...   
Amenity                0.00     -0.01      0.02        0.01        0.00  ...   
Bump                   0.00     -0.01     -0.01        0.00        0.00  ...   
Crossing               0.05     -0.03      0.03        0.04        0.03  ...   
Give_Way              -0.00      0.00      0.01        0.00        0.00  ...   
Junction              -0.03     -0.00      0.03       -0.01        0.01  ...   
No_Exit                0.01     -0.01     -0.00        0.01        0.00  ...   
Railway                0.00     -0.00      0.01        0.00       -0.00  ...   
Roundabout             0.00      0.00      0.00        0.00        0.00  ...   
Station                0.03     -0.01      0.04        0.02        0.01  ...   
Stop                   0.01     -0.02     -0.01        0.01        0.00  ...   
Traffic_Calming        0.01     -0.00      0.00        0.00        0.00  ...   
Traffic_Signal         0.02     -0.01      0.04        0.03        0.02  ...   
Turning_Loop            NaN       NaN       NaN         NaN         NaN  ...   

                 Give_Way  Junction  No_Exit  Railway  Roundabout  Station  \
Severity            -0.00      0.05    -0.01    -0.01       -0.00    -0.05   
Start_Lat            0.00      0.05    -0.01     0.00       -0.00    -0.05   
Start_Lng            0.02     -0.04     0.01    -0.02        0.00     0.01   
Distance            -0.01      0.03    -0.01    -0.02       -0.00    -0.04   
Temperature          0.00     -0.02     0.01     0.00        0.00     0.03   
Wind_Chill          -0.00     -0.03     0.01     0.00        0.00     0.03   
Humidity             0.00     -0.00    -0.01    -0.00        0.00    -0.01   
Pressure             0.01      0.03    -0.00     0.01        0.00     0.04   
Visibility           0.00     -0.01     0.01     0.00        0.00     0.02   
Wind_Speed           0.00      0.01     0.00    -0.00        0.00     0.01   
Amenity              0.01     -0.03     0.01     0.05        0.00     0.15   
Bump                 0.00     -0.00     0.01     0.00       -0.00     0.00   
Crossing             0.06     -0.09     0.06     0.18        0.00     0.17   
Give_Way             1.00     -0.01     0.01     0.00        0.00    -0.00   
Junction            -0.01      1.00    -0.00    -0.01        0.01    -0.04   
No_Exit              0.01     -0.00     1.00     0.00       -0.00     0.02   
Railway              0.00     -0.01     0.00     1.00       -0.00     0.12   
Roundabout           0.00      0.01    -0.00    -0.00        1.00    -0.00   
Station             -0.00     -0.04     0.02     0.12       -0.00     1.00   
Stop                 0.03     -0.04     0.03     0.01        0.00     0.04   
Traffic_Calming      0.00     -0.00     0.01     0.00        0.00     0.01   
Traffic_Signal       0.07     -0.10     0.03     0.06       -0.00     0.12   
Turning_Loop          NaN       NaN      NaN      NaN         NaN      NaN   

                 Stop  Traffic_Calming  Traffic_Signal  Turning_Loop  
Severity        -0.05            -0.01           -0.11           NaN  
Start_Lat       -0.00            -0.00           -0.07           NaN  
Start_Lng       -0.03            -0.00            0.07           NaN  
Distance        -0.03            -0.01           -0.11           NaN  
Temperature      0.01             0.00            0.05           NaN  
Wind_Chill       0.01             0.01            0.02           NaN  
Humidity        -0.02            -0.00           -0.01           NaN  
Pressure        -0.01             0.00            0.04           NaN  
Visibility       0.01             0.00            0.03           NaN  
Wind_Speed       0.00             0.00            0.02           NaN  
Amenity          0.03             0.02            0.11           NaN  
Bump             0.02             0.68           -0.00           NaN  
Crossing         0.12             0.04            0.48           NaN  
Give_Way         0.03             0.00            0.07           NaN  
Junction        -0.04            -0.00           -0.10           NaN  
No_Exit          0.03             0.01            0.03           NaN  
Railway          0.01             0.00            0.06           NaN  
Roundabout       0.00             0.00           -0.00           NaN  
Station          0.04             0.01            0.12           NaN  
Stop             1.00             0.03           -0.05           NaN  
Traffic_Calming  0.03             1.00            0.01           NaN  
Traffic_Signal  -0.05             0.01            1.00           NaN  
Turning_Loop      NaN              NaN             NaN           NaN  

[23 rows x 23 columns]
In [48]:
# Plotting the heatmap
plt.figure(figsize=(15, 15))  # Making the plot larger
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)

# Improving the aesthetics
plt.title('Correlation Matrix of Accident Data', fontsize=16)
plt.xticks(rotation=45, ha='right', fontsize=10)  # Rotating x labels for better readability
plt.yticks(fontsize=10)
plt.tight_layout()  # Adjusts the plot to ensure everything fits without overlapping

# Display the plot
plt.show()
Out[48]:
<Figure size 1500x1500 with 0 Axes>
Out[48]:
<Axes: >
Out[48]:
Text(0.5, 1.0, 'Correlation Matrix of Accident Data')
Out[48]:
(array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5, 10.5,
        11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
        22.5]),
 [Text(0.5, 0, 'Severity'),
  Text(1.5, 0, 'Start_Lat'),
  Text(2.5, 0, 'Start_Lng'),
  Text(3.5, 0, 'Distance'),
  Text(4.5, 0, 'Temperature'),
  Text(5.5, 0, 'Wind_Chill'),
  Text(6.5, 0, 'Humidity'),
  Text(7.5, 0, 'Pressure'),
  Text(8.5, 0, 'Visibility'),
  Text(9.5, 0, 'Wind_Speed'),
  Text(10.5, 0, 'Amenity'),
  Text(11.5, 0, 'Bump'),
  Text(12.5, 0, 'Crossing'),
  Text(13.5, 0, 'Give_Way'),
  Text(14.5, 0, 'Junction'),
  Text(15.5, 0, 'No_Exit'),
  Text(16.5, 0, 'Railway'),
  Text(17.5, 0, 'Roundabout'),
  Text(18.5, 0, 'Station'),
  Text(19.5, 0, 'Stop'),
  Text(20.5, 0, 'Traffic_Calming'),
  Text(21.5, 0, 'Traffic_Signal'),
  Text(22.5, 0, 'Turning_Loop')])
Out[48]:
(array([ 0.5,  1.5,  2.5,  3.5,  4.5,  5.5,  6.5,  7.5,  8.5,  9.5, 10.5,
        11.5, 12.5, 13.5, 14.5, 15.5, 16.5, 17.5, 18.5, 19.5, 20.5, 21.5,
        22.5]),
 [Text(0, 0.5, 'Severity'),
  Text(0, 1.5, 'Start_Lat'),
  Text(0, 2.5, 'Start_Lng'),
  Text(0, 3.5, 'Distance'),
  Text(0, 4.5, 'Temperature'),
  Text(0, 5.5, 'Wind_Chill'),
  Text(0, 6.5, 'Humidity'),
  Text(0, 7.5, 'Pressure'),
  Text(0, 8.5, 'Visibility'),
  Text(0, 9.5, 'Wind_Speed'),
  Text(0, 10.5, 'Amenity'),
  Text(0, 11.5, 'Bump'),
  Text(0, 12.5, 'Crossing'),
  Text(0, 13.5, 'Give_Way'),
  Text(0, 14.5, 'Junction'),
  Text(0, 15.5, 'No_Exit'),
  Text(0, 16.5, 'Railway'),
  Text(0, 17.5, 'Roundabout'),
  Text(0, 18.5, 'Station'),
  Text(0, 19.5, 'Stop'),
  Text(0, 20.5, 'Traffic_Calming'),
  Text(0, 21.5, 'Traffic_Signal'),
  Text(0, 22.5, 'Turning_Loop')])
In [49]:
# Top 5 highest correlations
# Flatten the correlation matrix and sort values
corr_pairs = correlation_matrix.unstack().sort_values(ascending=False).drop_duplicates()

# Remove self-correlation pairs
corr_pairs = corr_pairs[corr_pairs.index.get_level_values(0) != corr_pairs.index.get_level_values(1)]

# Get the top 5 highest correlations
highest_corr_pairs = corr_pairs.head(5)
print("Top 5 highest correlations:")
print(highest_corr_pairs)
Top 5 highest correlations:
Wind_Chill      Temperature       0.90
Bump            Traffic_Calming   0.68
Traffic_Signal  Crossing          0.48
Visibility      Temperature       0.21
Wind_Chill      Visibility        0.20
dtype: float64
In [50]:
# Extracting the names of the columns involved in the highest correlations
highest_corr_columns = set()
for (col1, col2) in highest_corr_pairs.index:
    highest_corr_columns.add(col1)
    highest_corr_columns.add(col2)

# Creating a new DataFrame with the highest correlated columns
highest_corr_df = correlation_matrix.loc[highest_corr_columns, highest_corr_columns]

# Plotting a heatmap for the highest correlations
plt.figure(figsize=(10, 8))
sns.heatmap(highest_corr_df, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.xticks(rotation=45, ha='right', fontsize=10) 
plt.title('Heatmap of Highest Correlations')
plt.show()
Out[50]:
<Figure size 1000x800 with 0 Axes>
Out[50]:
<Axes: >
Out[50]:
(array([0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5]),
 [Text(0.5, 0, 'Traffic_Calming'),
  Text(1.5, 0, 'Visibility'),
  Text(2.5, 0, 'Temperature'),
  Text(3.5, 0, 'Crossing'),
  Text(4.5, 0, 'Traffic_Signal'),
  Text(5.5, 0, 'Bump'),
  Text(6.5, 0, 'Wind_Chill')])
Out[50]:
Text(0.5, 1.0, 'Heatmap of Highest Correlations')

Temperature and Wind Chill: There is a high positive correlation between temperature and wind chill, which is expected as they are both related to weather conditions. However, there is counterintutive finding on accidents vs weather. Most acccidents are happening at moderate temperature 50 F (rather than extreme weathers) in concenterated cities and states.

Traffic Calming and Bump: Traffic calming measures have a strong positive correlation with the presence of bumps. This indicates that bumps are a commonly used traffic calming measure.

BUMP IMPACT ON ACCIDENTS

In [51]:
Bump_incident_counts = df.groupby('Bump').size().reset_index(name='Number_of_Incidents')
Bump_incident_counts
barplot=sns.barplot(x='Bump', y='Number_of_Incidents', data=Bump_incident_counts)
plt.title('Impact of Bumps on Number of Incidents')
plt.xlabel('Bump Presence')
plt.ylabel('Number of Incidents')
plt.yscale('log')
# Annotate each bar with the value
for index, row in Bump_incident_counts.iterrows():
    barplot.text(index, row.Number_of_Incidents, row.Number_of_Incidents, color='black', ha="center")
plt.show()

# Traffic Calming and Bumps: Bumps are effectively used to calm traffic. 
# The government should assess and possibly increase their use, especially near schools and residential areas.
Out[51]:
Bump Number_of_Incidents
0 False 7562619
1 True 3446
Out[51]:
Text(0.5, 1.0, 'Impact of Bumps on Number of Incidents')
Out[51]:
Text(0.5, 0, 'Bump Presence')
Out[51]:
Text(0, 0.5, 'Number of Incidents')
Out[51]:
Text(0, 7562619, '7562619')
Out[51]:
Text(1, 3446, '3446')
In [52]:
# Group by 'State' and 'Bump' and count the number of incidents
state_bump_counts = df.groupby(['State', 'Bump']).size().reset_index(name='Number_of_Incidents')
state_bump_counts.sort_values(by=['Number_of_Incidents'],ascending= False).head(10)
Out[52]:
State Bump Number_of_Incidents
4 CA False 1712434
13 FL False 860487
67 TX False 576676
62 SC False 374995
51 NY False 336935
40 NC False 330052
71 VA False 296286
59 PA False 286779
35 MN False 187497
57 OR False 172814
In [53]:
# Group by 'State' and 'Bump' and count the number of incidents
state_bump_counts = df.groupby(['State', 'Bump']).size().reset_index(name='Number_of_Incidents')

# Sort the results by the number of incidents, not state, to get the top incidents
state_bump_counts_sorted = state_bump_counts.sort_values('Number_of_Incidents', ascending=False).head(10)

# Plotting the results
plt.figure(figsize=(14, 10))
barplot = sns.barplot(data=state_bump_counts_sorted, x='State', y='Number_of_Incidents', hue='Bump')
plt.xticks(rotation=90)
plt.title('Statewise Impact of Bumps on Number of Incidents')
plt.xlabel('State')
plt.ylabel('Number of Incidents')
# plt.legend(title='Bump Presence')
plt.legend(title='Bump Presence', loc='upper left', bbox_to_anchor=(0.87, 1))

plt.tight_layout(rect=[0, 0, 0.9, 1])  # Adjust the rect parameter to make space for the legend


# Annotate the top 3 states
top_states = state_bump_counts_sorted.head(3)
for i, (index, row) in enumerate(top_states.iterrows()):
    # Get the x location of the bar
    x = i
    # Annotate with state name and incident count
    plt.text(x, row['Number_of_Incidents'], f"{row['State']}: {int(row['Number_of_Incidents'])}", color='black', ha="left")

# plt.tight_layout()
plt.show()
Out[53]:
<Figure size 1400x1000 with 0 Axes>
Out[53]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
 [Text(0, 0, 'CA'),
  Text(1, 0, 'FL'),
  Text(2, 0, 'TX'),
  Text(3, 0, 'SC'),
  Text(4, 0, 'NY'),
  Text(5, 0, 'NC'),
  Text(6, 0, 'VA'),
  Text(7, 0, 'PA'),
  Text(8, 0, 'MN'),
  Text(9, 0, 'OR')])
Out[53]:
Text(0.5, 1.0, 'Statewise Impact of Bumps on Number of Incidents')
Out[53]:
Text(0.5, 0, 'State')
Out[53]:
Text(0, 0.5, 'Number of Incidents')
Out[53]:
<matplotlib.legend.Legend at 0x466675610>
Out[53]:
Text(0, 1712434, 'CA: 1712434')
Out[53]:
Text(1, 860487, 'FL: 860487')
Out[53]:
Text(2, 576676, 'TX: 576676')

Comparison number of accidents in region without and with inclusion of Signal Bumps

In [54]:
# Filter the data to include only rows with 'Bumps' being False
no_bump_data = df[df['Bump'] == False]

# Define regions based on the median values of latitude and longitude
median_lat = no_bump_data['Start_Lat'].median()
median_lng = no_bump_data['Start_Lng'].median()

# Function to determine the region based on latitude and longitude
def determine_region(lat, lng, median_lat, median_lng):
    if lat >= median_lat and lng >= median_lng:
        return 'North-East'
    elif lat < median_lat and lng >= median_lng:
        return 'South-East'
    elif lat >= median_lat and lng < median_lng:
        return 'North-West'
    else:
        return 'South-West'

# Apply the function to create a new 'Region' column
no_bump_data['Region'] = no_bump_data.apply(lambda x: determine_region(x['Start_Lat'], x['Start_Lng'], median_lat, median_lng), axis=1)

# Use KMeans to cluster the data based on 'Latitude' and 'Longitude'
kmeans = KMeans(n_clusters=4)  # We choose 4 to match the number of regions we've defined
no_bump_data['Cluster'] = kmeans.fit_predict(no_bump_data[['Start_Lat', 'Start_Lng']])

# Plotting
plt.figure(figsize=(10, 6))

# We use hue to color the data points based on the new 'Region' column
sns.scatterplot(data=no_bump_data, x='Start_Lng', y='Start_Lat', hue='Region', style='Cluster',
                palette='Set1', alpha=0.6)

# Plot the cluster centers
centers = kmeans.cluster_centers_

# Count the number of accidents per cluster
cluster_accident_counts = no_bump_data.groupby('Cluster').size()

# Annotate the cluster centers with the accident counts
for i, count in enumerate(cluster_accident_counts):
    plt.scatter(centers[i, 1], centers[i, 0], c='black', s=100, alpha=0.75, marker='X')
    plt.text(centers[i, 1], centers[i, 0], str(count), color='black', fontsize=12, ha='left', va='bottom')

plt.title('Accidents Region-wise Clustering Without Bumps')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()


# The geometric center of each cluster, centriod
Out[54]:
<Figure size 1000x600 with 0 Axes>
Out[54]:
<Axes: xlabel='Start_Lng', ylabel='Start_Lat'>
Out[54]:
<matplotlib.collections.PathCollection at 0x466934cd0>
Out[54]:
Text(-94.71852943148494, 36.09142266368345, '1504984')
Out[54]:
<matplotlib.collections.PathCollection at 0x13dca01d0>
Out[54]:
Text(-118.79929848386561, 37.07373277317291, '2317352')
Out[54]:
<matplotlib.collections.PathCollection at 0x13dc97490>
Out[54]:
Text(-77.37245365483233, 40.020127049912716, '1880248')
Out[54]:
<matplotlib.collections.PathCollection at 0x466982710>
Out[54]:
Text(-82.33422544395954, 31.285675518950853, '1860035')
Out[54]:
Text(0.5, 1.0, 'Accidents Region-wise Clustering Without Bumps')
Out[54]:
Text(0.5, 0, 'Longitude')
Out[54]:
Text(0, 0.5, 'Latitude')
Out[54]:
<matplotlib.legend.Legend at 0x4669bb7d0>

The cluster shows most happening without bumps. The North-East region has a dense concentration of accidents, which might correspond to an urbanized area with heavy traffic.

In [55]:
# Filter the data to include only rows with 'Bumps' being False
no_bump_data = df[df['Bump'] == True]

# Define regions based on the median values of latitude and longitude
median_lat = no_bump_data['Start_Lat'].median()
median_lng = no_bump_data['Start_Lng'].median()

# Function to determine the region based on latitude and longitude
def determine_region(lat, lng, median_lat, median_lng):
    if lat >= median_lat and lng >= median_lng:
        return 'North-East'
    elif lat < median_lat and lng >= median_lng:
        return 'South-East'
    elif lat >= median_lat and lng < median_lng:
        return 'North-West'
    else:
        return 'South-West'

# Apply the function to create a new 'Region' column
no_bump_data['Region'] = no_bump_data.apply(lambda x: determine_region(x['Start_Lat'], x['Start_Lng'], median_lat, median_lng), axis=1)

# Use KMeans to cluster the data based on 'Latitude' and 'Longitude'
kmeans = KMeans(n_clusters=4)  # We choose 4 to match the number of regions we've defined
no_bump_data['Cluster'] = kmeans.fit_predict(no_bump_data[['Start_Lat', 'Start_Lng']])

# Plotting
plt.figure(figsize=(10, 6))

# We use hue to color the data points based on the new 'Region' column
sns.scatterplot(data=no_bump_data, x='Start_Lng', y='Start_Lat', hue='Region', style='Cluster',
                palette='Set1', alpha=0.6)

# Plot the cluster centers
centers = kmeans.cluster_centers_

# Count the number of accidents per cluster
cluster_accident_counts = no_bump_data.groupby('Cluster').size()

# Annotate the cluster centers with the accident counts
for i, count in enumerate(cluster_accident_counts):
    plt.scatter(centers[i, 1], centers[i, 0], c='black', s=100, alpha=0.75, marker='X')
    plt.text(centers[i, 1], centers[i, 0], str(count), color='black', fontsize=12, ha='left', va='bottom')

plt.title('Accidents Region-wise Clustering with Bumps')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend(title='Region', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.show()
Out[55]:
<Figure size 1000x600 with 0 Axes>
Out[55]:
<Axes: xlabel='Start_Lng', ylabel='Start_Lat'>
Out[55]:
<matplotlib.collections.PathCollection at 0x3ad8a2390>
Out[55]:
Text(-110.46515301034441, 33.69676629985365, '849')
Out[55]:
<matplotlib.collections.PathCollection at 0x3ad86ba50>
Out[55]:
Text(-78.52076859639344, 34.170773181142074, '859')
Out[55]:
<matplotlib.collections.PathCollection at 0x3ad867dd0>
Out[55]:
Text(-121.55398349022491, 39.17052359944348, '1310')
Out[55]:
<matplotlib.collections.PathCollection at 0x3ad860350>
Out[55]:
Text(-96.0865541878241, 32.59554682332386, '428')
Out[55]:
Text(0.5, 1.0, 'Accidents Region-wise Clustering with Bumps')
Out[55]:
Text(0.5, 0, 'Longitude')
Out[55]:
Text(0, 0.5, 'Latitude')
Out[55]:
<matplotlib.legend.Legend at 0x3ad8663d0>

The cluster shows least accidents happening with inclusion of speed bumps.

Summary of Finding: Regions with signal bumps tend to have lower accident counts, indicating that bumps may contribute to road safety and accident prevention.

Validity of Finding: The clustering figure shows clear patterns where regions without and with signal bumps (represented by different clusters colors) have varying accident counts.

Managerial Insights: Get more bumps! The government should increase the number of signal bumps in high-accident areas could effectively reduce accident rates. Companies involved in road safety solutions could see increased demand for signal bumps and related traffic-calming products.

3) Analysis of Weather Conditions vs Visibility and Severity Metrics¶

In [56]:
df_11 = df.copy()
In [57]:
# Example bin definitions
weather_bins = {
        'Clear': ['Clear', 'Fair'],
        'Cloudy': ['Cloudy', 'Mostly Cloudy', 'Partly Cloudy', 'Scattered Clouds', 'Overcast'],
        'Rainy': ['Light Rain', 'Rain', 'Light Freezing Drizzle', 'Light Drizzle', 'Heavy Rain', 'Light Freezing Rain', 'Drizzle', 'Light Freezing Fog', 'Light Rain Showers', 'Showers in the Vicinity', 'T-Storm', 'Thunder', 'Patches of Fog', 'Heavy T-Storm', 'Heavy Thunderstorms and Rain', 'Funnel Cloud', 'Heavy T-Storm / Windy', 'Heavy Thunderstorms and Snow', 'Rain / Windy', 'Heavy Rain / Windy', 'Squalls', 'Heavy Ice Pellets', 'Thunder / Windy', 'Drizzle and Fog', 'T-Storm / Windy', 'Smoke / Windy', 'Haze / Windy', 'Light Drizzle / Windy', 'Widespread Dust / Windy', 'Wintry Mix', 'Wintry Mix / Windy', 'Light Snow with Thunder', 'Fog / Windy', 'Snow and Thunder', 'Sleet / Windy', 'Heavy Freezing Rain / Windy', 'Squalls / Windy', 'Light Rain Shower / Windy', 'Snow and Thunder / Windy', 'Light Sleet / Windy', 'Sand / Dust Whirlwinds', 'Mist / Windy', 'Drizzle / Windy', 'Duststorm', 'Sand / Dust Whirls Nearby', 'Thunder and Hail', 'Freezing Rain / Windy', 'Light Snow Shower / Windy', 'Partial Fog', 'Thunder / Wintry Mix / Windy', 'Patches of Fog / Windy', 'Rain and Sleet', 'Light Snow Grains', 'Partial Fog / Windy', 'Sand / Dust Whirlwinds / Windy', 'Heavy Snow with Thunder', 'Heavy Blowing Snow', 'Low Drifting Snow', 'Light Hail', 'Light Thunderstorm', 'Heavy Freezing Drizzle', 'Light Blowing Snow', 'Thunderstorms and Snow', 'Heavy Rain Showers', 'Rain Shower / Windy', 'Sleet and Thunder', 'Heavy Sleet and Thunder', 'Drifting Snow / Windy', 'Shallow Fog / Windy', 'Thunder and Hail / Windy', 'Heavy Sleet / Windy', 'Sand / Windy', 'Heavy Rain Shower / Windy', 'Blowing Snow Nearby', 'Blowing Sand', 'Heavy Rain Shower', 'Drifting Snow', 'Heavy Thunderstorms with Small Hail'],
        'Harsh Conditions':['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy', 'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy','Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy', 'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy', 'Blowing Snow Nearby', 'Heavy Rain Shower / Windy','Hail','Volcanic Ash','Tornado']
        #'Snowy': ['Light Snow', 'Snow', 'Light Snow / Windy', 'Snow Grains', 'Snow Showers', 'Snow / Windy', 'Light Snow and Sleet', 'Snow and Sleet', 'Light Snow and Sleet / Windy', 'Snow and Sleet / Windy'],
        #'Windy': ['Blowing Dust / Windy', 'Fair / Windy', 'Mostly Cloudy / Windy', 'Light Rain / Windy', 'T-Storm / Windy', 'Blowing Snow / Windy', 'Freezing Rain / Windy', 'Light Snow and Sleet / Windy', 'Sleet and Thunder / Windy', 'Blowing Snow Nearby', 'Heavy Rain Shower / Windy'],
        #'Hail': ['Hail'],
        #'Volcanic Ash': ['Volcanic Ash'],
        #'Tornado': ['Tornado']

}

def map_weather_to_bins(weather):
    for bin_name, bin_values in weather_bins.items():
        if weather in bin_values:
            return bin_name
    return 'Other' 

df_11['Weather_Bin'] = df_11['Weather_Condition'].apply(map_weather_to_bins)
In [58]:
# Checking the unique values and their counts in the 'Severity' column
WeatherBin_counts = df_11['Weather_Bin'].value_counts()

WeatherBin_counts
Out[58]:
Clear               3370412
Cloudy              3131679
Rainy                573883
Other                276572
Harsh Conditions     213519
Name: Weather_Bin, dtype: int64
In [59]:
# Filter the dataset for the specified weather conditions
df_clear = df_11[df_11['Weather_Bin'] == 'Clear']
df_cloudy = df_11[df_11['Weather_Bin'] == 'Cloudy']
df_harsh = df_11[df_11['Weather_Bin'] == 'Harsh Conditions']
df_rainy = df_11[df_11['Weather_Bin'] == 'Rainy']
df_other = df_11[df_11['Weather_Bin'] == 'Other']
In [60]:
len(df_clear)
len(df_cloudy)
len(df_harsh)
len(df_rainy)
len(df_other)
Out[60]:
3370412
Out[60]:
3131679
Out[60]:
213519
Out[60]:
573883
Out[60]:
276572
In [61]:
# Calculate the mean values for visibility and severity for each weather condition
clear_visibility_mean = df_clear['Visibility'].mean()
clear_severity_mean = df_clear['Severity'].mean()
print("clear_visibility_mean:")
print(clear_visibility_mean)
print("clear_severity_mean:")
print(clear_severity_mean)


cloudy_visibility_mean = df_cloudy['Visibility'].mean()
cloudy_severity_mean = df_cloudy['Severity'].mean()
print("cloudy_visibility_mean:")
print(cloudy_visibility_mean)
print("cloudy_severity_mean:")
print(cloudy_severity_mean)

harsh_visibility_mean = df_harsh['Visibility'].mean()
harsh_severity_mean = df_harsh['Severity'].mean()
print("harsh_visibility_mean:")
print(harsh_visibility_mean)
print("harsh_severity_mean:")
print(harsh_severity_mean)

rainy_visibility_mean = df_rainy['Visibility'].mean()
rainy_severity_mean = df_rainy['Severity'].mean()
print("rainy_visibility_mean:")
print(rainy_visibility_mean)
print("rainy_severity_mean:")
print(rainy_severity_mean)

other_visibility_mean = df_other['Visibility'].mean()
other_severity_mean = df_other['Severity'].mean()
print("other_visibility_mean:")
print(other_visibility_mean)
print("other_severity_mean:")
print(other_severity_mean)
clear_visibility_mean:
9.852139471376198
clear_severity_mean:
2.187872283863219
cloudy_visibility_mean:
9.579545553679031
cloudy_severity_mean:
2.2383638297539434
harsh_visibility_mean:
4.803740276041009
harsh_severity_mean:
2.218214772455847
rainy_visibility_mean:
6.001854785731588
rainy_severity_mean:
2.2487859023529184
other_visibility_mean:
4.567289277294882
other_severity_mean:
2.1923079704380775
In [62]:
# Calculate the total records and severity distribution percentages for each weather condition
clear_total_records = len(df_clear)
clear_severity_1 = len(df_clear[df_clear['Severity'] == 1])
clear_severity_2 = len(df_clear[df_clear['Severity'] == 2])
clear_severity_3 = len(df_clear[df_clear['Severity'] == 3])
clear_severity_4 = len(df_clear[df_clear['Severity'] == 4])
clear_severity_1_pct = (clear_severity_1 / clear_total_records) * 100
clear_severity_2_pct = (clear_severity_2 / clear_total_records) * 100
clear_severity_3_pct = (clear_severity_3 / clear_total_records) * 100
clear_severity_4_pct = (clear_severity_4 / clear_total_records) * 100

harsh_total_records = len(df_harsh)
#harsh_severity_low = len(df_harsh[df_harsh['Severity'] <= 2])
#harsh_severity_high = len(df_harsh[df_harsh['Severity'] >= 3])
#harsh_severity_low_pct = (harsh_severity_low / harsh_total_records) * 100
#harsh_severity_high_pct = (harsh_severity_high / harsh_total_records) * 100
harsh_severity_1 = len(df_harsh[df_harsh['Severity'] == 1])
harsh_severity_2 = len(df_harsh[df_harsh['Severity'] == 2])
harsh_severity_3 = len(df_harsh[df_harsh['Severity'] == 3])
harsh_severity_4 = len(df_harsh[df_harsh['Severity'] == 4])
harsh_severity_1_pct = (harsh_severity_1 / harsh_total_records) * 100
harsh_severity_2_pct = (harsh_severity_2 / harsh_total_records) * 100
harsh_severity_3_pct = (harsh_severity_3 / harsh_total_records) * 100
harsh_severity_4_pct = (harsh_severity_4 / harsh_total_records) * 100

cloudy_total_records = len(df_cloudy)
#cloudy_severity_low = len(df_cloudy[df_cloudy['Severity'] <= 2])
#cloudy_severity_high = len(df_cloudy[df_cloudy['Severity'] >= 3])
#cloudy_severity_low_pct = (cloudy_severity_low / cloudy_total_records) * 100
#cloudy_severity_high_pct = (cloudy_severity_high / cloudy_total_records) * 100
cloudy_severity_1 = len(df_cloudy[df_cloudy['Severity'] == 1])
cloudy_severity_2 = len(df_cloudy[df_cloudy['Severity'] == 2])
cloudy_severity_3 = len(df_cloudy[df_cloudy['Severity'] == 3])
cloudy_severity_4 = len(df_cloudy[df_cloudy['Severity'] == 4])
cloudy_severity_1_pct = (cloudy_severity_1 / cloudy_total_records) * 100
cloudy_severity_2_pct = (cloudy_severity_2 / cloudy_total_records) * 100
cloudy_severity_3_pct = (cloudy_severity_3 / cloudy_total_records) * 100
cloudy_severity_4_pct = (cloudy_severity_4 / cloudy_total_records) * 100

rainy_total_records = len(df_rainy)
#rainy_severity_low = len(df_rainy[df_rainy['Severity'] <= 2])
#rainy_severity_high = len(df_rainy[df_rainy['Severity'] >= 3])
#rainy_severity_low_pct = (rainy_severity_low / rainy_total_records) * 100
#rainy_severity_high_pct = (rainy_severity_high / rainy_total_records) * 100
rainy_severity_1 = len(df_rainy[df_rainy['Severity'] == 1])
rainy_severity_2 = len(df_rainy[df_rainy['Severity'] == 2])
rainy_severity_3 = len(df_rainy[df_rainy['Severity'] == 3])
rainy_severity_4 = len(df_rainy[df_rainy['Severity'] == 4])
rainy_severity_1_pct = (rainy_severity_1 / rainy_total_records) * 100
rainy_severity_2_pct = (rainy_severity_2 / rainy_total_records) * 100
rainy_severity_3_pct = (rainy_severity_3 / rainy_total_records) * 100
rainy_severity_4_pct = (rainy_severity_4 / rainy_total_records) * 100


other_total_records = len(df_other)
#other_severity_low = len(df_other[df_other['Severity'] <= 2])
#other_severity_high = len(df_other[df_other['Severity'] >= 3])
#other_severity_low_pct = (other_severity_low / other_total_records) * 100
#other_severity_high_pct = (other_severity_high / other_total_records) * 100
other_severity_1 = len(df_other[df_other['Severity'] == 1])
other_severity_2 = len(df_other[df_other['Severity'] == 2])
other_severity_3 = len(df_other[df_other['Severity'] == 3])
other_severity_4 = len(df_other[df_other['Severity'] == 4])
other_severity_1_pct = (other_severity_1 / other_total_records) * 100
other_severity_2_pct = (other_severity_2 / other_total_records) * 100
other_severity_3_pct = (other_severity_3 / other_total_records) * 100
other_severity_4_pct = (other_severity_4 / other_total_records) * 100
In [63]:
# Display the calculated means and percentages
mean_values_and_percentages = {
    'Weather': ['Clear', 'Harsh Conditions', 'Cloudy','Rainy','Other'],
    'Visibility Mean': [clear_visibility_mean, harsh_visibility_mean, cloudy_visibility_mean, rainy_visibility_mean, other_visibility_mean],
    'Severity Mean': [clear_severity_mean, harsh_severity_mean, cloudy_severity_mean, rainy_severity_mean, other_severity_mean],
    'Total Records': [clear_total_records, harsh_total_records, cloudy_total_records, rainy_total_records, other_total_records],
    'Severity 1': [clear_severity_1_pct, harsh_severity_1_pct, cloudy_severity_1_pct, rainy_severity_1_pct, other_severity_1_pct],
    'Severity 2': [clear_severity_2_pct, harsh_severity_2_pct, cloudy_severity_2_pct, rainy_severity_2_pct, other_severity_2_pct],
    'Severity 3': [clear_severity_3_pct, harsh_severity_3_pct, cloudy_severity_3_pct, rainy_severity_3_pct, other_severity_3_pct],
    'Severity 4': [clear_severity_4_pct, harsh_severity_4_pct, cloudy_severity_4_pct, rainy_severity_4_pct, other_severity_4_pct]
}
In [64]:
# Convert the dictionary to a DataFrame for plotting
df_summary = pd.DataFrame(mean_values_and_percentages)
In [65]:
df_summary
Out[65]:
Weather Visibility Mean Severity Mean Total Records Severity 1 Severity 2 Severity 3 Severity 4
0 Clear 9.85 2.19 3370412 1.01 81.77 14.66 2.57
1 Harsh Conditions 4.80 2.22 213519 0.31 81.00 15.23 3.45
2 Cloudy 9.58 2.24 3131679 0.80 77.19 19.38 2.63
3 Rainy 6.00 2.25 573883 0.65 76.46 20.25 2.64
4 Other 4.57 2.19 276572 0.55 82.02 15.07 2.36
In [66]:
# Creating a heatmap for the calculated mean values and percentages
# We need to transpose the summary DataFrame to have Weather Conditions as columns for the heatmap
df_summary_transposed = df_summary.set_index('Weather').T



# Plotting the heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df_summary_transposed, annot=True, fmt=".2f", cmap='coolwarm', linewidths=.3)
plt.title("Heatmap of Weather Conditions vs Visibility and Severity Metrics")
plt.ylabel("Metrics")
plt.show()
Out[66]:
<Figure size 1000x600 with 0 Axes>
Out[66]:
<Axes: xlabel='Weather'>
Out[66]:
Text(0.5, 1.0, 'Heatmap of Weather Conditions vs Visibility and Severity Metrics')
Out[66]:
Text(95.72222222222221, 0.5, 'Metrics')

Key Insights

  1. Harsh Conditions show a significantly lower visibility mean, which suggests that such weather conditions can substantially reduce visibility.
  2. It is surprising to see that the severity mean does not vary significantly across different weather conditions, indicating that the weather might not drastically change the average severity of occurrences.

Managerial recommendation:

Impact on ADAS (Advanced Driving Assistance System):

In Harsh Conditions, where visibility is significantly reduced, ADAS features like adaptive headlights, night vision systems, and forward-collision warnings should adapt according to above insights and assist in detecting obstacles on the road and help prevent accidents.

Impact on Distance(mi) against Severity:¶

In [67]:
sum_distance_severity_1 = df_11.loc[df_11['Severity'] == 1, 'Distance'].sum()
count_distance_severity_1 = df_11.loc[df_11['Severity'] == 1, 'Distance'].count()

print("Sum of Distance for Severity 1:", sum_distance_severity_1)
print("Count of Distance for Severity 1:", count_distance_severity_1)
Distance_Sev1 = sum_distance_severity_1/count_distance_severity_1
Distance_Sev1
Sum of Distance for Severity 1: 7228.2770003972655
Count of Distance for Severity 1: 65007
Out[67]:
0.11119228699058971
In [68]:
sum_distance_severity_2 = df_11.loc[df_11['Severity'] == 2, 'Distance'].sum()
count_distance_severity_2 = df_11.loc[df_11['Severity'] == 2, 'Distance'].count()

print("Sum of Distance for Severity 2:", sum_distance_severity_2)
print("Count of Distance for Severity 2:", count_distance_severity_2)
Distance_Sev2 = sum_distance_severity_2/count_distance_severity_2
Distance_Sev2
Sum of Distance for Severity 2: 3357955.815013315
Count of Distance for Severity 2: 6011746
Out[68]:
0.5585658168214883
In [69]:
sum_distance_severity_3 = df_11.loc[df_11['Severity'] == 3, 'Distance'].sum()
count_distance_severity_3 = df_11.loc[df_11['Severity'] == 3, 'Distance'].count()

print("Sum of Distance for Severity 3:", sum_distance_severity_3)
print("Count of Distance for Severity 3:", count_distance_severity_3)
Distance_Sev3 = sum_distance_severity_3/count_distance_severity_3
Distance_Sev3
Sum of Distance for Severity 3: 546066.5290523764
Count of Distance for Severity 3: 1291377
Out[69]:
0.42285601265345163
In [70]:
sum_distance_severity_4 = df_11.loc[df_11['Severity'] == 4, 'Distance'].sum()
count_distance_severity_4 = df_11.loc[df_11['Severity'] == 4, 'Distance'].count()

print("Sum of Distance for Severity 4:", sum_distance_severity_4)
print("Count of Distance for Severity 4:", count_distance_severity_4)
Distance_Sev4 = sum_distance_severity_4/count_distance_severity_4
Distance_Sev4
Sum of Distance for Severity 4: 296317.4930206464
Count of Distance for Severity 4: 197935
Out[70]:
1.4970444490395656
In [71]:
# Assuming severity_means is a Series or list containing mean values for each severity level
# severity_values should contain the corresponding severity values (1, 2, 3, 4)

# For example:
Distance_Sev_Means = [Distance_Sev1, Distance_Sev2, Distance_Sev3, Distance_Sev4]
severity_values = [1, 2, 3, 4]

# Create a DataFrame from the means and severity values
df_means = pd.DataFrame({'Severity': severity_values, 'Mean_Distance': Distance_Sev_Means})

# Define a color palette for each severity level
severity_palette = {1: 'green', 2: 'brown', 3: 'orange', 4: 'red'}

# Plot the bar plot
plt.figure(figsize=(12, 8))
sns.barplot(x='Severity', y='Mean_Distance', data=df_means, palette=severity_palette, width=0.2, dodge=False)
plt.title('Bar Plot of Mean Distance against Severity')
plt.xlabel('Severity')
plt.ylabel('Mean Distance')
plt.show()
Out[71]:
<Figure size 1200x800 with 0 Axes>
Out[71]:
<Axes: xlabel='Severity', ylabel='Mean_Distance'>
Out[71]:
Text(0.5, 1.0, 'Bar Plot of Mean Distance against Severity')
Out[71]:
Text(0.5, 0, 'Severity')
Out[71]:
Text(0, 0.5, 'Mean Distance')

Key Insight: Severity 1 accidents have the shortest mean distance, while Severity 4 accidents impact significantly longer stretches of road. This could indicate that accidents with high severity tend to involve larger areas, potentially due to factors such as more vehicles being involved.

Interesting finding: Severity 2 accidents, while not the most impactful in terms of traffic delay, affect a longer stretch of road than Severity 3 accidents. This could suggest several scenarios:

  1. Severity 2 accidents may involve incidents that, although not causing severe traffic delays, cover a larger area. This might be due to incidents that result in obstructions or hazards spread over a greater distance, causing moderate traffic slowdowns.

  2. Severity 3 accidents, while affecting a smaller area, could be more concentrated and cause significant delays, possibly due to the road being blocked or more intensive emergency services response required.

Collaborating Interesting findings and respecting Managerial Recommendations below:¶

1) a) Insight :

- You will be amazed with the fact that almost 34% of accidents have occoured even though there were Traffic Signal, Crossing and Junction present in location.((Traffic Signal) 14.84 + (Crossing) 11.35 + (Junction) 7.35 =33.54 %)
- On detailed analysis of these factors by diving the US states into 7 zones based on geographical conditions, it is quite evident from the graph that 68.47% of accidents in south east region occurred even though these 3 factors(Traffic signal,Crossing and Junction) were present there.

South East - Traffic Signal) 31.27 + (Crossing) 28.7 + (Junction) 8.50 = 68.47 %

b) Managerial Recommendation :

- More penalties must be levied by the Government for traffic rules violations( specially in south east region) so that people would be more cautious while driving and accidents count would be less.
  1. a) Insight :
    • Regions with signal bumps tend to have lower accident counts, indicating that bumps may contribute to road safety and accident prevention.

b) Managerial Insights:

- Get more bumps! The government should increase the number of signal bumps in high-accident areas could effectively reduce accident rates. Companies involved in road safety solutions could see increased demand for signal bumps and related traffic-calming products.
  1. a) Insight

    • It is surprising to see that the severity mean does not vary significantly across different weather conditions, indicating that the weather might not drastically change the average severity of occurrences.

    b) Managerial recommendation:

    • Impact on ADAS (Advanced Driving Assistance System): In Harsh Conditions, where visibility is significantly reduced, ADAS features like adaptive headlights, night vision systems, and forward-collision warnings should adapt according to above insights and assist in detecting obstacles on the road and help prevent accidents.

      (Note: ADAS system is typically used in self-driving cars and automotives.)